For the past few years, a lot of web market has been developed and you can now easily buy items of any kinds. However, plenty of those marketplaces are growing up on Dark Web. Even if government attempts to fight against this kind of illegal market, new ones resurface or re-migrate frequently. On top of that, currently, we don’t really know in detail how these websites operate.For more information on Deep web and Dark Web marketplaces please read the article: “Mining the Dark Web” [1]. You will also find in this paper an analysis of Agora Market.
This paper presents a research carried out on one of the largest marketplace (specially for drugs) on Internet, the AlphaBay Dark Web Market. This web Market has caugth the attention of governemental agencies since two teenagers aged of 13 and 18 died after overdosing on a powerful synthetic opioid. Therefore, during our research, AlphaBay and Hansa, another dark web market, have been shut down on July, 2017 as a part of a law enforcement operation by the Federal Bureau of Investigation, the Drugs Enforcement Administration and European law enforcement agencies acting through Europol.[2] [3]
According to US Attorney General Jeff Sessions the aim of this action was to caution criminals from thinking that they could evade prosecution by using the dark web. However, it is widely believed that other web markets will take the place of AlphaBay. By the way, the popularity of AlphaBay can be explained by the shut down of Silk Road 2.0 on 2013 since it has been launched on september 2014.
The understanding of such illegal market is crucial to fight it. Information gathered in those websites, allow to identify which are the most wanted ads for the consumer and where they come from. Therefore it might be possible to detect the footprint of each seller and, thus, help governmental agencies to identify recurrent sellers with various hidden identities.
Thus AlphaBay market will be analysed. Its nature, its different countries of origin, its main sellers, its predominance of items and so on will be investigated. During a first phase “Basic Statistics” will be carried out on the Database, in order to discover the marketplace and to point out its trends. Then, experimental results of data mining techniques will be discussed.
#----------------------------------------------------------
# Library :
#----------------------------------------------------------
#install.packages("stringr")
#install.packages("units")
#install.packages("ggmap")
#install.packages("plotrix")
#install.packages("rattle")
#install.packages("rpart")
#install.packages("rpart.plot")
#install.packages("RColorBrewer")
#install.packages("arules")
#install.packages("arulesViz")
#install.packages("e1071")
#install.packages("bnlearn")
# Mamipulation string
library(stringr)
# Using unit
library(units)
# Plot a map
library(ggmap)
library(plotrix)
# Decision Tree
library(rattle)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
# Association rules
library(arules)Error in names(frame)[names(frame) == "x"] <- name :
names() applied to a non-vector
library(arulesViz)
# Bayesian classification, naive algorithm
library(e1071)
# Bayesian Network
library(bnlearn)#----------------------------------------------------------
# Importation of the data :
#----------------------------------------------------------
data <- as.data.frame(read.csv("../alphaClean.csv"))Thanks to Sin Wee Lee and Andres Baravalle data have been collected on the AlphaBay Dark Web Market. Each row of the data represents an ad and all the significant information can be find in the different columns : title, description, price (in USD) , url link, seller, payment, origin, destination, category, timestamp that is to say the date when the ad was collected, creation date of the ad, number of product sold since this date and a link to the image.
Thus, the first step was to clean the data (remove special characters, switch in lowercase … ) and makes it readable in a computer way. That it’s to say, to find in the title or description of the ads the amount (number and mass) of the product they are selling. Indeed, at the beginning these information were not given distinctly. Therefore an important work has been done on it in order to make analyses easier. With these information, the price of one unit of one dose (1 gram) of the product has been calculated and added.
Here is what the Database looks like :
#----------------------------------------------------------
# Display of the data :
#----------------------------------------------------------
printdata <- function() {
print.data <- subset (data[19409:19411,], select=-X)
for(i in 2:4){
print.data[,i] <- as.character(print.data[,i])
}
print(print.data)
}
printdata()Table 1 - Database Sample
As it has been said before, basic statistics have been first realized. Let’s see the general distribution and trend of the market.
1. Global view of ads distribution.
#-----------------------------------------------
# Number of ads in the world
#-----------------------------------------------
NumberOfAds <- function() {
# Get rid of unwanted orign like Worldwide and Null which are not relevant
matching_vector <- c( !str_detect(data$origin, "Worldwide") & !str_detect(data$origin, "NULL"))
sumup <- sort(table(data[matching_vector, "origin"]), decreasing=TRUE)
# Bar plot with the total number ofs ads in each country
par(las=1)#display yaxis horizontally
par(mar=c(5,6.5,4,0.5)) #give space for yaxis
barp <- barplot(sumup[1:10], main="Main dealer-countries", xlim= c(0,max(sumup[1:10])+1000), xlab="Number of ads",horiz = TRUE, col = rainbow(10), cex.names = 0.8)
# Labels
# Calculation in percentage
sumuppercent<- round(100*(sumup/sum(sumup)), 1)
# round(a,1) : one digit after the comma
lab <- c()
for(i in 1:length(sumuppercent)) {
lab[i] <- paste(sumuppercent[[i]], "%", sep=" ")
}
barp <- text(y = barp, x = sumup[1:10], label = lab[1:10], pos=4 , cex = 0.8, col= "black")
# Frame
box(which = "outer", lty = "solid")
}
NumberOfAds()Figure 1 - 10 Main Dealer Countries
This bar-chart represents the 10 main countries in the world regarding the number of ads. As we can see, United States are the biggest dealer far ahead of the rest. Their number of ads is more than twice as the number of the second one, which is United Kingdom.
Moreover, it is noticeable that most of these countries are economically powerful. For instance on these ten main countries, five belong to the Group of Seven (G7), only Japan and Italy are not present. And other ones are also located in powerful areas where a lot of trade are made with other countries.
Furthermore an interesting thing to point out is that the first four countries are exactly the one where the word “AlphaBay” is the most researched on Google [4].
Figure 2 - AlphaBay Google Researchs in the World
2. Now let’s have a look at the distribution of ads per category.
selectDrug <- function(drugName){
matching_vector <- c( (str_detect(data$category, drugName)))
return(matching_vector)
}#-----------------------------------------------
# Number of ads per categories
#-----------------------------------------------
categ <- function(){
cat <- c()
for(i in 1:length(data$category)) {
cat[i] <- unlist(strsplit(as.character(data$category[i]), "/"))[2]
if(is.na(cat[i])) {cat[i] <- "Other Listings"}
}
tab_cat <- table(cat)
tab_cat <- sort(tab_cat, decreasing=TRUE)
cat.data <- as.data.frame(tab_cat)
radial.plot(cat.data$Freq,labels=cat.data$cat,label.prop=1.1,rp.type="r",start=5.6,clockwise=TRUE,lwd=4,line.col=rainbow(length(tab_cat)),main="Number of ads per categories",radial.labels=c(5,10,15,20))
mtext("In thousands ads",side = 4,line=2,las=1,cex=1.08,font=3)
# Frame
box(which = "inner", lty = "solid")
}
categ()Figure 3 - Distribution of the Market
RateDrugAd <- function() {
rate <- c()
NbAd <- nrow(data)
# Select all "Drugs & Chemicals" ads
matching_vector <- c( str_detect(data$category, "Drugs & Chemicals"))
drug.data <- data[matching_vector,]
NbAdDrug <- nrow(drug.data)
rate[1] <- round((NbAdDrug/NbAd)*100,2)
# Select all "Fraud" ads
matching_vector <- c( str_detect(data$category, "Fraud"))
fraud.data <- data[matching_vector,]
NbAdFraud <- nrow(fraud.data)
rate[2] <- round((NbAdFraud/NbAd)*100,2)
return(rate)
}
rate <- RateDrugAd()To begin with, it is observable that there are 12 main categories in this web marketplace. It appears that “Drugs and Chemicals” group is the largest one. By the way, it represents 45.64 % of the global market.
// It is not suprising since in some countries, such as the United Kingdom and some country of Europe, it is difficult to be provided in some Drugs [refer to logistics and legql framework; please do not up “legqlisation” with “no prosecution”]. Indeed, these countries have strict rules over the consumption of Drugs. However, currently a debate is raising concerning the legalisation of some Drugs such as Cannabis in many countries. Moreover, at the rate things are going, consumer will probably have other ways for buying drugs, that’s why providing information over the current market seems necessary. //
It is also worth noting that the second most popular category is “Fraud”, that is to say all the ads regarding impersonation, deception papers and accounts. It represents 13.5 % of the market, which is smaller than the Drug market. Eventually, all other items (digital product, weapons, jewelry …) represent a small rate of the marketplace.
AlphaBay Web Market is a well known place for dealing drugs, this last chart has proved that. Moreover, its reputation can be reflected by looking at Google statistics [5].
Figure 4 - Evolution of AlphaBay and Dream Market Google research
You can see, on this graph, AlphaBay in blue and Dream Market in red, an other Dark Web Market, wich is still operating. This is showing the evolution of Google researchs that are related to these two Web Market from 2015 till June 2017. It should be noted that AlphaBay has become more and more popular for the last tree years, and that just before being shut down by the authority in july 2017, it was one of the most popular marketplace. Thus, let’s focus on the drug market.
1. Distribution of drugs.
#-----------------------------------------------
# Distribution of Drugs in the market
#-----------------------------------------------
DistributionDrugs <- function() {
#----------------------------
# The most common drugs
#----------------------------
drugs <- c("Cocaine", "Meth", "LSD", "Opioids", "Cannabis", "Steroids", "Ecstasy", "Ketamine", "Heroin", "Shrooms", "Tobacco", "Benzos", "Paraphernalia")
freq <- c()
for(i in 1:length(drugs)){
matching_vector <- selectDrug(drugName=drugs[i]);
sumup<-summary(matching_vector)
freq[i] <- sumup[3]
}
freq <- as.numeric(freq)
res <- data.frame(drugs, freq)
res <- res[order(res$freq, decreasing = TRUE),]
#----------------------
# Pie Chart
#----------------------
# 1- Labels :
# Calculation in percentage
piepercent<- round(100*res$freq/sum(res$freq), 1)
# round(a,1) : one digit after the comma
lab <- c()
for(i in 1:length(piepercent)) {
lab[i] <- paste(piepercent[[i]], "%", sep=" ")
}
# 2- Title :
title <- "Distribution of drugs"
# 3- Colors :
c <- rainbow(length(piepercent))
# 4- Plot :
pie3D(piepercent,labels = lab,labelcex = 1, main = title ,col=c, theta = 0.9, explode = 0.04)
# 5- Legend :
legend(x=-2.3,y=-1.1,res$drugs, cex = 0.9, fill = c,ncol=5,border=NA, xpd=NA)
# Frame
box(which = "inner", lty = "solid")
}
DistributionDrugs()Figure 5 - Drug Distribution
A large range of drugs categories can be discerned. Nevertheless, Cannabis, Opioids and Ecstasy cover more than 50 % of the market. This is not surprising since Cannabis and Opioids can be easily found in some countries. Thus, it is easy for them to sell these drugs and in parallel they are very requested in remaining countries. Indeed, Cannabis is legalised in some countries and some Opioids can be obtained thanks to prescription by a doctor. The massive presence of Opioids on the market raises the issue of how some of them are prescripted. Maybe some countries should re-think their stance over prescription
// [what about so,e examples on legal production]. //
The remaining of the market is splited by all other drugs.
2. World distribution of drugs
#---------------------------------------------
# Number of ads of Drugs in the world
#---------------------------------------------
NumberOfAdsDrugs <- function(){
# Get rid of unwanted orign like Worldwide and Null which are not relevant
matching_vector <- c( str_detect(data$category, "Drugs") & !str_detect(data$origin, "Worldwide") & !str_detect(data$origin, "NULL"))
sumup <- sort(table(data[matching_vector, "origin"]), decreasing=TRUE)
# Bar plot with the total number of ads of Drugs in each country
par(las=1)#display yaxis horizontally
par(mar=c(5,6.5,4,0.5)) #give space for yaxis
barp <- barplot(sumup[1:10], main="Number of ads of Drugs in the World", xlab="Number of ads",xlim = c(0,max(sumup[1:10]+1000)), col = rainbow(10), cex.names = 0.8, horiz = TRUE)
# Calculation in percentage
sumuppercent<- round(100*(sumup/sum(sumup)), 1)
# round(a,1) : one digit after the comma
lab <- c()
for(i in 1:length(sumuppercent)) {
lab[i] <- paste(sumuppercent[[i]], "%", sep=" ")
}
barp <- text(y = barp, x = sumup[1:10], label = lab[1:10], pos=4 , cex = 0.8, col= "black")
# Frame
box(which = "outer", lty = "solid")
}
NumberOfAdsDrugs()Figure 6 - 10 Main Drug Dealer Countries
Once more, the 10 main dealer-countries in the world are plotted but this time regarding the number of drugs ads. At first sight, the chart looks like the first one. This is coherent, indeed, by comparing the ratio between drugs ads and the total number of ads, it is intelligible that they are mainly dealing drugs.
This also matches with the second chart that shows that drugs are the main item in the market. There are few exceptions such as Afghanistan which has been replaced by Spain and Canada has been reversed with China.
But so far it is possible to conclude that the market of drugs is gathered in Europe and the north of America.
3. Take a global view of the drugs market in Europe with the following map.
MapEurope <- function() {
# Select the ads about drugs and get rid of the irrelevant orign Worlwide
matching_vector <- c( (str_detect(data$category, "Drugs") ) & !str_detect(data$origin, "Worldwide"))
sumup <- sort(summary(data[matching_vector, "origin"]), decreasing=TRUE)
# Read a file containing the latitude and longitude of the "center" of each country
data_country <- read.csv("./Stats/lat_long.csv")
lat_long <- data.frame(Country = data_country$Country , long= data_country$Longitude..average., lat= data_country$Latitude..average.)
# Create a data.frame with the name of the country and its nb of ads
v <- data.frame(name= names(sumup) , amount = sumup)
# Merge v with lat_long in order to have a data with Country/NbofAds/lattitude/longitude
data_plot <- merge(v, lat_long, by.x = "name", by.y = "Country" )
# Create a map of EUROPE with circles showing the amount of ads
map <- get_map(location = 'Europe', zoom =4 )
mapPoints <- ggmap(map) + xlab("") + ylab("") + ggtitle("Number of ads of Drugs in Europe")+
geom_point(data = data_plot,aes(x =long, y = lat, size =amount)) +scale_size_continuous(limits=c(0,3000),breaks=c(0,500,1000,1500,2000), range = c(0,13))
# Display
mapPoints
}
MapEurope()Figure 7 - European Drug Dealer
Circles show the amount of ads concerning drugs. The map confirms previous assumptions that there are a lot of Drug dealers in Europe with, as major dealer countries, United Kingdom, Netherlands and Germany.
It appears that the principal dealer-countries are located on the Atlantic Coast and own huge harbours where there is important merchant shipping. Whereas on the East part there are not a lot of activities. This is probably due to the fact that dealers are using international commercial maritime traffics in order to dispatch their drugs all around the world. Maritime transport is an option increasingly used since it allows them to carry large quantities at one time. Drugs can be transported in small and fast boats (Go-Fast-Boat between countries border) or in containers on commercial vessels. Thus, significant seaports in Europe such as Rotterdam in Netherlands or Antwerp in Belgium are key points for this type of trafficking. In 2014 “Dutch police estimated that 25-50 % of the cocaine reaching Europe now enters via the port, which handles around 11 million containers a year.” [6]
Let’s now focus more specifically on different countries and study their trend. To do so, export and import flows of the country have been investigated.
1. United Kingdom exportation
The pie below represents the repartition of each category that United Kingdom exports. Only the second subcategory has been kept because it appears to be the most relevant since the first one only gives information on the nature of the ad (for instance “Drugs & Chemicals”).
#-----------------------------------------------
# Importation / Exportation of a country
#-----------------------------------------------
country_Export <- function() {
#-------------------
# Initialization
#-------------------
country <- "United Kingdom"
num <- 0
# Importation / Exportation :
if (num == 0) {
way <- "origin"
txt <- "- Exportation"
} else if (num == 1) {
way <- "destination"
txt <- "- Importation"
}
#------------------
# Analysis
#------------------
# Country as destination
matching_vector <- str_detect(data[,way], country)
# list of the categories (among the line that have "Country" as origin)
# -> Products (categories) exporting by the country
country_cat <- data[matching_vector,"category"]
# Handling of this categories
# Regular expression for spliting the categories
regex <- "/(.*)/(.*)/(.*)"
cat <- str_match(country_cat, regex)
# Counting this categories
tab <- table(cat[,3]) #cat[,3] : 2nd category
tab <- sort(tab, decreasing = TRUE) # Sorting (biggest in first)
tab <- tab[1:10] # Taking only the most important
#-----------------
# Pie Chart
#-----------------
# 1- Labels :
# Calculation in percentage
piepercent<- round(100*tab/sum(tab), 1)
# round(a,1) : one digit after the comma
lab <- c()
for(i in 1:length(piepercent)) {
lab[i] <- paste(piepercent[[i]], "%", sep=" ")
}
# 2- Title :
title <- paste(country, txt, sep=" ")
# 3- Colors :
c <- rainbow(length(piepercent))
# 4- Plot :
pie3D(piepercent,labels = lab,labelcex = 1, main = title ,col=c, theta = 0.9, explode = 0.04)
# 5- Legend :
legend(x=-2.3,y=-1.1,names(piepercent), cex = 0.8, fill = c,ncol=4,border=NA, xpd=NA)
# Frame
box(which = "inner", lty = "solid")
}
country_Export()Figure 8 - United Kingdom exportation
Given that most of exported items are drugs, that is not surprising that they are the most sold product, as it has been seen before. Once again this pie chart shows the market diversity. Although a huge part concerns “Cannabis & Hashish” category, “Stimulants” and other highly dangerous drugs are significantly present as we.
Most of European countries follows the same rules as United Kingdom and this confirms previous assumptions.
2. Let’s have a look at the exportations of Afghanistan which seem different to United Kingdom :
#-----------------------------------------------
# Importation / Exportation of a country
#-----------------------------------------------
country_Export <- function() {
#-------------------
# Initialization
#-------------------
country <- "Afghanistan"
num <- 0
# Importation / Exportation :
if (num == 0) {
way <- "origin"
txt <- "- Exportation"
} else if (num == 1) {
way <- "destination"
txt <- "- Importation"
}
#------------------
# Analysis
#------------------
# Country as destination
matching_vector <- str_detect(data[,way], country)
# list of the categories (among the line that have "Country" as origin)
# -> Products (categories) exporting by the country
country_cat <- data[matching_vector,"category"]
# Handling of this categories
# Regular expression for spliting the categories
regex <- "/(.*)/(.*)/(.*)"
cat <- str_match(country_cat, regex)
# Counting this categories
tab <- table(cat[,3]) #cat[,3] : 2nd category
tab <- sort(tab, decreasing = TRUE) # Sorting (biggest in first)
tab <- tab[1:10] # Taking only the most important
#-----------------
# Pie Chart
#-----------------
# 1- Labels :
# Calculation in percentage
piepercent<- round(100*tab/sum(tab), 1)
# round(a,1) : one digit after the comma
lab <- c()
for(i in 1:length(piepercent)) {
lab[i] <- paste(piepercent[[i]], "%", sep=" ")
}
# 2- Title :
title <- paste(country, txt, sep=" ")
# 3- Colors :
c <- rainbow(length(piepercent))
# 4- Plot :
pie3D(piepercent,labels = lab,labelcex = 1, main = title ,col=c, theta = 0.9, explode = 0.04)
# 5- Legend :
legend(x=-2,y=-1,names(piepercent), cex = 0.8, fill = c,ncol=3,border=NA, xpd=NA)
# Frame
box(which = "inner", lty = "solid")
}
country_Export()Figure 9 - Afghanistan exportation
What is suprising is that, unlike most of countries, Afghanistan doesn’t really retail drugs on AlphaBay Market. Actually, a vast majority of exported products are false identity, deception account… Afghanistan is also dealing electronic devices or softwares.
3. Let’s compare France export & import flows and see if there is a difference between them.
#-----------------------------------------------
# Importation / Exportation of a country
#-----------------------------------------------
Country_Export_Import <- function() {
#-------------------
# Initialization
#-------------------
country <- "France"
#---------------------------
# Analysis - Exportation
#---------------------------
# Country as origin
matching_vector <- str_detect(data[,"origin"], country)
# list of the categories (among the line that have "Country" as origin)
# -> Products (categories) exporting by the country
country_cat <- data[matching_vector,"category"]
# Handling of this categories
# Regular expression for spliting the categories
regex <- "/(.*)/(.*)/(.*)"
cat <- str_match(country_cat, regex)
# Counting this categories
tab_exp <- table(cat[,3]) #cat[,3] : 2nd category
tab_exp <- sort(tab_exp, decreasing = TRUE) # Sorting (biggest in first)
tab_exp <- tab_exp[1:10] # Taking only the most important
#---------------------------
# Analysis - Importation
#---------------------------
# Country as destination
matching_vector <- str_detect(data[,"destination"], country)
# list of the categories (among the line that have "Country" as destination)
# -> Products (categories) importing by the country
country_cat <- data[matching_vector,"category"]
# Handling of this categories
# Regular expression for spliting the categories
regex <- "/(.*)/(.*)/(.*)"
cat <- str_match(country_cat, regex)
# Counting this categories
tab_imp <- table(cat[,3]) #cat[,3] : 2nd category
tab_imp <- sort(tab_imp, decreasing = TRUE) # Sorting (biggest in first)
tab_imp <- tab_imp[1:10] # Taking only the most important
#-------------------------
# Analysis - Fusion
#-------------------------
# Transformation in data frame
tab_exp <- as.data.frame(tab_exp)
tab_imp <- as.data.frame(tab_imp)
# Merger of the 2 data frame in order to have the same labels
tab <- merge(tab_exp,tab_imp,by.x="Var1",by.y="Var1",all = TRUE)
# Handling of the "NA" value (substitution by 0)
for (j in 2:3) {
for(i in 1:length(tab[,j])){
if(is.na(tab[i,j])) {tab[i,j] <-0}
}
}
#---------------------------
# Pie Chart - Exporation
#---------------------------
# ploting 2 graphics om the same picture
par(mfrow = c(1,2))
# 1- Labels :
# Calculation in percentage
piepercent <- round(100*tab[,2]/sum(tab[,2]), 1)
# round(a,1) : one digit after the comma
lab <- c()
for(i in 1:length(piepercent)) {
if(piepercent[[i]] == 0) {lab[i] <- ""}
else {lab[i] <- paste(piepercent[[i]], "%", sep=" ")}
}
# 2- Colors :
c <- rainbow(length(tab[,1]))
# 3- Plot :
pie(piepercent,labels=lab,col=c)
mtext("Exportation",cex=1)
#----------------------------
# Pie Chart - Importation
#----------------------------
# 1- Labels :
# Calculation in percentage
piepercent <- round(100*tab[,3]/sum(tab[,3]), 1)
# round(a,1) : one digit after the comma
lab <- c()
for(i in 1:length(piepercent)) {
if(piepercent[[i]] == 0) {lab[i] <- ""}
else {lab[i] <- paste(piepercent[[i]], "%", sep=" ")}
}
# 2- Plot :
pie(piepercent, labels=lab, col=c)
mtext("Importation",cex=1)
#------------------
# General - Plot
#-----------------
par(oma=c(0,0,1.8,0))
title("France",outer=TRUE)
legend(x=-4,y=-1.1,tab[,1], cex = 0.8, fill=c,ncol=3,border=NA, xpd=NA)
# Frame
box(which = "outer", lty = "solid")
}
Country_Export_Import()Figure 10 - France exportation and importation
It is noticeable that both pies are different. The percentages of each category are not equal and some of them don’t appear systematicaly in the other pie chart. However it is obvious that significant exported drugs are also imported. Morevover, France is importing some drugs that are not local.
Nevertheless these conclusions should be moderated since targeting one particular country reduces significantly the number of information used for statistics.
After analysing general trend and flows, one interesting topic to analyse is market prices. One may ask if sold products in AlphaBay are cheaper than in the streets.
1. Average prices on the AlphaBay Web Market
Firstly, the average price of one gram of the most common drugs has been calculated and results below has been obtained.
DrugsPrices <- function() {
drugs <- c("Cocaine", "Meth", "Opioids", "Cannabis", "Steroids", "Ecstasy", "Ketamine", "Heroin", "NBOME","Shrooms", "Tobacco", "Benzos", "Paraphernalia")
med <-c()
for(i in 1:length(drugs)){
matching_vector <- selectDrug(drugName = drugs[i]);
med[i] <- median((data[matching_vector, "priceUnitDose"]))
}
priceDrugs <- data.frame(drugs, med);
priceDrugs$med <- round(priceDrugs$med,2)
priceDrugs <- priceDrugs[order(priceDrugs$med, decreasing=TRUE), ]
par(las=1) # Display yaxis horizontally
par(mar=c(4,8,3,2)) # Give space for yaxis
barp <- barplot(priceDrugs$med, main="Average Price of Drugs per Gram", names.arg = priceDrugs$drugs, xlim = c(0,max(priceDrugs$med+100)), cex.names = 0.8, col =rainbow(length(priceDrugs$drugs)), horiz =TRUE)
barp <- text(y = barp, x = priceDrugs$med, label = paste(priceDrugs$med, " $", sep=""), pos=4 , cex = 0.8, col= "Black")
# Frame
box(which = "outer", lty = "solid")
return (priceDrugs)
}
priceDrugs <- DrugsPrices()Figure 11 - Drugs Prices
2. Comparison with the “street”
Finally, some information about prices of street sellers have been collected, for same drugs as above. Thus, prices in articles and websites have been gathered. Unfortunately, values on some drugs are missing. Below are results of these investigations.
#-----------------------------------------
# Prices find on articles
#-----------------------------------------
DrugsPricesDoc <- function(){
cols <- c("Cocaine", "Meth", "Opioids", "Cannabis" , "Steroids", "Ecstasy", "Ketamine", "Heroin", "NBOME","Shrooms", "Tobacco", "Benzos", "Paraphernalia" , "MDMA", "Amphetamine", "Crack", "LSD" , "URL")
ref1 <- c( 35 , 200 , NA , (5.3 + 7.85)/2 , NA , 15 , 25 , 100 , NA , NA , NA , NA , NA , 40 , 5 , NA , NA , "http://www.drugwise.org.uk/how-much-do-drugs-cost/")
ref2 <- c( 67 , NA , NA , 51 , NA , 15 , 32 , 129 , NA , NA , NA , NA , NA , 51 , 15 , 97 , NA , "http://www.telegraph.co.uk/news/uknews/crime/11346133/The-cost-of-street-drugs-in-Britain.html")
ref3 <- c( 110 , 80 , NA , NA , NA , NA , NA , 170 , NA , 5.7 , NA , NA , NA , 150 , NA , NA , 12000 , "http://www.rehabcenter.net/the-average-cost-of-illegal-drugs-on-the-street/ " )
ref4 <- c( 80 , 109 , NA , NA , NA , 19.12 , NA , 91.16 , NA , NA , NA , NA , NA , NA , NA , NA , NA , " http://o.canada.com/business/interactive-what-illegal-drugs-cost-on-the-street-around-the-world")
ref5 <- c( 64 , NA , NA , NA , NA , 20 , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , " http://www.thestudentpocketguide.com/2012/01/student-life/health-and-relationships/facts-about-drugs/")
doc_drugs <- t(data.frame(ref1, ref2, ref3, ref4, ref5))
colnames(doc_drugs) <- cols
# Calculate the mean price of each drugs find on articles
price_doc <- c()
for(i in 1 : length(cols)){
price_doc[i] <- summary(as.numeric(doc_drugs[,i]))[[4]]
}
price_doc.data <- data.frame(cols, price_doc)
# Merge the previous dataframe which correspond to the mean price of each drugs in the data
# with the dataframe created above
beside_plot <- merge(price_doc.data, priceDrugs, by.x ="cols", by.y ="drugs")
rownames(beside_plot) <- beside_plot[,1]
beside_plot <- beside_plot[,-1]
# Creating the barplot
par(las=1)#display yaxis horizontally
par(mar=c(6,8,4,3)) # Give space for yaxis
b <- barplot(rbind(beside_plot[,1], beside_plot[,2]), main="Average price of Drugs", xlim = c(0, 400),
xlab="Price of Drugs", beside=TRUE, names.arg = rownames(beside_plot), col=1:2, horiz = TRUE, space = c(0,0.4), cex.names = 0.8)
axis(side=1,at=c(50,150,250,350),labels=c(50,150,250,350))
lab <- c("Street", "AlphaBay")
legend("topright",lab,fill=1:2, cex=0.8)
# Frame
box(which = "outer", lty = "solid")
}
DrugsPricesDoc()Figure 12 - Comparison of Prices
Globally, it appears that prices of street sellers are often largely higher than AlphaBay ads. In few cases both prices tend to be similar.(Please find in the last section all references used : [7], [8], [9], [10], [11])
Secondly, data mining techniques have been performed in order to discover hidden rules and correlations in the database. Another goal is to predict the value of one variable given other values.
Please note that for all this part the analysis is only on Drug Market.
The first thing to wonder is how to guess the seller of an ad. To answer this question, different data mining methods have been used, especially Decision Tree and Bayesian classification.
Algorithm has been run on a subset of the database with by rows ads and by columns the origin, category, seller and price. The aim is to predict who is selling each ads. By training the algorithm on one half of the data, predictions could be made on other half. Given that most of sellers own just few ads (occasional advertisements) only the main ones were selected, which represent at best the market. Otherwise, data mining techniques will fail in finding rules for them.
To check efficiency of the algorithm a measure of accuracy must be calculated. It is obtain by comparing the prediction of decision tree method with the true value.
1. Decision tree
Using rpart package, which is based on the CART Alorithm, a decision tree has been created. Thanks to it, predictions of the seller could be made. Prognoses on the five most significant sellers and the related tree can be found below.
#----------------------------------------------------------------------
# Decision tree - CART algorithm
# Prediction of the seller knowing the price / category / origin
# Plot
#-----------------------------------------------------------------------
Dtsellers <- function(){
#-----------------
# New Data
#-----------------
# Select all "Drugs & Chemicals" ads
matching_vector <- c( str_detect(data$category, "Drugs & Chemicals"))
dectree.data <- data[matching_vector,]
# Select the column of the data that are interesting for the tree
# ie removing colunm like "id" or "url" that don't give any informations
dectree.data <- subset(dectree.data, select=c(origin,category,seller,priceUnitDose))
# Subset : choose the colunm that you want
# Handling : column categorie
# Regular expression for spliting the categories
regex <- "/(.*)/(.*)/(.*)"
cat <- str_match(dectree.data$category, regex)
dectree.data$category <- cat[,3] # keep only the second part
# Handling : seller
tab_sel <- table(dectree.data$seller)
tab_sel <- sort(tab_sel, decreasing=TRUE) # Sorting (biggest in first)
tab_sel <- tab_sel[1:5] # Taking only the most important : main sellers
name_sel <- names(tab_sel)
# New data keeping only the main sellers
dectree.data <-subset(dectree.data, seller %in% name_sel)
# Random rows :
dectree.data <- dectree.data[sample(nrow(dectree.data),nrow(dectree.data),replace=FALSE), ]
#---------------------
# Decision tree
#---------------------
# Factor
dectree.data$seller <- factor(dectree.data$seller)
# Half of the data for making the decision tree
train.data <- dectree.data[1:(floor(nrow(dectree.data))/2),]
# Creation of the tree
tree <- rpart(seller ~.,data=train.data, method="class")
# Plot
fancyRpartPlot(tree, sub="")
# Frame
box(which = "outer", lty = "solid")
#--------------------
# Prediction
#--------------------
# The other half for the prediction
pred.data <- dectree.data[(floor(nrow(dectree.data)/2)+1):nrow(dectree.data),]
# Making prediction
pred <- predict(tree,pred.data,type="class")
# Analysis:
# Comparison between the result and the prediction (prediction in colunm)
conf <- table(pred.data[,match("seller",names(pred.data))],pred)
# Accurency
acc <- round((sum(diag(conf)) / sum(conf)*100),2)
print(conf)
cat("\n")
cat(" ")
compTab <- TableCaption(compTab, "Sellers Prediction / Decision Tree method")
Result <- c(acc,compTab)
return(Result)
}
#accDTseller
ResultDT1 <- Dtsellers() pred
ALaurizen jnenfrancis klosterbier rgn ROCKETLABS
ALaurizen 57 0 0 0 6
jnenfrancis 0 66 0 4 0
klosterbier 0 0 69 4 0
rgn 0 0 2 69 0
ROCKETLABS 8 0 0 0 54
Table 2 - Sellers Prediction / Decision Tree method
pred
ALaurizen jnenfrancis klosterbier rgn ROCKETLABS
ALaurizen 63 0 0 0 1
jnenfrancis 0 78 0 3 0
klosterbier 0 0 86 2 0
rgn 0 0 1 67 0
ROCKETLABS 8 0 0 0 40
Table 2 - Sellers Prediction / Decision Tree method
Figure 13 - Seller Decision Tree
<<<<<<< HEAD
[1] "The accuracy is : 92.92 %"
=======
[1] "The accuracy is : 95.70 %"
>>>>>>> fbecac5e50de9ac80711ed0110f6122d224b69da
It is striking to realise that predictions are very reliable since the accuracy is very high.
In the Table 2 you can find by columns our Prognoses and by rows the real sellers. In other words the diagonal shows the number of correct predictions and all other values are mistakes made by the algortihm.
The tree enables to have a good visual aspect on it and gives a lot of essential information on these five main sellers.
Besides, with quick basic statistics, it is possible to validate this tree and its branchs, that is to say how it is splited. Indeed let’s have a look at these five dealers. This array contains the main country, main category and median price of their ads.
#-------------------------------------------------------------
# Statistics for decision tree
#-------------------------------------------------------------
DTanalysis <- function() {
# Select all "Drugs & Chemicals" ads
matching_vector <- c( str_detect(data$category, "Drugs & Chemicals"))
Drug.data <- data[matching_vector,]
# Select the column of the data that are interesting for the tree
# ie removing colunm like "id" or "url" that don't give any informations
Drug.data <- subset(Drug.data, select=c(origin,category,seller,priceUnitDose))
# Subset : choose the colunm that you want
# Handling : column categorie
# Regular expression for spliting the categories
regex <- "/(.*)/(.*)/(.*)"
cat <- str_match(Drug.data$category, regex)
Drug.data$category <- cat[,3] # keep only the second part
# Handling : seller
tab_sel <- table(Drug.data$seller)
tab_sel <- sort(tab_sel, decreasing=TRUE) # Sorting (biggest in first)
tab_sel <- tab_sel[1:5] # Taking only the 5 main sellers
seller <- names(tab_sel) # Names of the 5 main sellers
# Initialization
ori <- c()
price <- c()
cat <- c()
# calculation of informations for each seller
for(i in 1:5) {
# Select ads from the ieme seller
matching_vector <- c( str_detect(Drug.data$seller, seller[i]))
sel.data <- Drug.data[matching_vector,]
ori[i] <- names(sort(table(sel.data$origin),decreasing=TRUE)[1]) # First country
cat[i] <-names(sort(table(sel.data$category),decreasing=TRUE)[1]) # First category
price[i] <- round(summary(sel.data$priceUnitDose)[[3]],2) # Median price
}
sel.data <- as.data.frame(seller)
sel.data$origin <- ori
sel.data$category <- cat
sel.data$price <- price
print(sel.data)
}
DTanalysis()Table 3 - Seller Analyses
Obviously, these information are exactly the ones that are in the tree. jnenfrancis is guessed thanks to origin United Kingdom. For the sellers dealing mainly in China, that is to say ALaurizen and ROCKETLABS, the spliting is made regarding the price (ALaurizen is obviously cheaper). For other sellers the algorithm use the category Cannabis & Hashish.
Predictions have been done here on only five sellers in order to have a readable tree (otherwise the size of the tree is too big for being plotted). However, prognoses with more sellers can be made and with still a good accuracy. For instance, with 10 sellers, the results below are obtained.
#----------------------------------------------------------------------
# Decision tree - CART algorithm
# Prediction of the seller knowing the price / category / origin
# Results
#-----------------------------------------------------------------------
Dtsellers2 <- function(){
#-----------------
# New Data
#-----------------
# Select all "Drugs & Chemicals" ads
matching_vector <- c( str_detect(data$category, "Drugs & Chemicals"))
dectree.data <- data[matching_vector,]
# Select the column of the data that are interesting for the tree
# ie removing colunm like "id" or "url" that don't give any informations
dectree.data <- subset(dectree.data, select=c(origin,category,seller,priceUnitDose))
# Subset : choose the colunm that you want
# Handling : column categorie
# Regular expression for spliting the categories
regex <- "/(.*)/(.*)/(.*)"
cat <- str_match(dectree.data$category, regex)
dectree.data$category <- cat[,3] # keep only the second part
# Handling : seller
tab_sel <- table(dectree.data$seller)
tab_sel <- sort(tab_sel, decreasing=TRUE) # Sorting (biggest in first)
tab_sel <- tab_sel[1:10] # Taking only the most important : main sellers
name_sel <- names(tab_sel)
# New data keeping only the main sellers
dectree.data <-subset(dectree.data, seller %in% name_sel)
# Random rows :
dectree.data <- dectree.data[sample(nrow(dectree.data),nrow(dectree.data),replace=FALSE), ]
#---------------------
# Decision tree
#---------------------
# Factor
dectree.data$seller <- factor(dectree.data$seller)
# Half of the data for making the decision tree
train.data <- dectree.data[1:(floor(nrow(dectree.data))/2),]
# Creation of the tree
tree <- rpart(seller ~.,data=train.data, method="class")
#--------------------
# Prediction
#--------------------
# The other half for the prediction
pred.data <- dectree.data[(floor(nrow(dectree.data)/2)+1):nrow(dectree.data),]
# Making prediction
pred <- predict(tree,pred.data,type="class")
# Analysis:
# Comparison between the result and the prediction (prediction in colunm)
conf <- table(pred.data[,match("seller",names(pred.data))],pred)
# Accurency
acc <- round((sum(diag(conf)) / sum(conf)*100),2)
# Display
conf <- as.data.frame(conf)
names(conf) <- c("Sellers","Prediction","Freq")
print(conf)
return(acc)
}
acc1 <- Dtsellers2()Table 4 - Sellers Prediction / Decision Tree method
<<<<<<< HEAD
[1] "The accuracy is : 91.34 %"
=======
[1] "The accuracy is : 86.42 %"
>>>>>>> fbecac5e50de9ac80711ed0110f6122d224b69da
Once more, this array presents the number of mistakes and correct predictions.
2. Bayesian classification - Naive alogrithm
Bayesian Classification has been used in the same objective as decision tree, making predictions. Here are results when running Bayesian Naive Algorithm with the same data.
#----------------------------------------------------------------------------------
# Bayesian Classification - Naive
# Prediction of the Seller knowing the origin / price / category
#-----------------------------------------------------------------------------------
BayesSellersV1 <- function(){
#-----------------
# New Data
#-----------------
# Select all "Drugs & Chemicals" ads
matching_vector <- c( str_detect(data$category, "Drugs & Chemicals"))
bayesian.data <- data[matching_vector,]
# Select the column of the data that are interesting
# ie removing colunm like "id" or "url" that don't give any informations
bayesian.data <- subset(bayesian.data, select=c(origin,category,seller,priceUnitDose))
# Subset : choose the colunm that you want
# Handling : column category
# Regular expression for spliting the categories
regex <- "/(.*)/(.*)/(.*)"
cat <- str_match(bayesian.data$category, regex)
bayesian.data$category <- cat[,3] # keep only the second part
sellers <- names(sort(table(bayesian.data$seller), decreasing = TRUE))[1:10]
bayesian.data <-subset(bayesian.data, seller %in% sellers)
bayesian.data$seller <- factor(bayesian.data$seller, labels = sellers)
#---------------------
# Bayesian stat
#---------------------
# Random rows :
bayesian.data <- bayesian.data[sample(nrow(bayesian.data),nrow(bayesian.data),replace=FALSE), ]
train.data <- bayesian.data[1:floor(nrow(bayesian.data)/2),]
pred.data <- bayesian.data[(floor(nrow(bayesian.data)/2)+1):nrow(bayesian.data),]
model <- naiveBayes(seller ~ ., data = train.data)
preds <- predict(model, newdata = pred.data)
conf_matrix <- table(preds, pred.data$seller)
acc <- round(sum(diag(conf_matrix)) / sum(conf_matrix)*100, 2)
# Display
conf_matrix <- data.frame(conf_matrix)
names(conf_matrix) <- c("Sellers","Prediction","Freq")
print(conf_matrix)
return(acc)
}
accBayesV1 <- BayesSellersV1()Table 5 - Sellers Prediction / Bayesian Classification
<<<<<<< HEAD
[1] "The accuracy is : 53.00 %"
The accuracy is 53 % which is not very good comparing to decision tree. However one way to improve the prognosis is to add new variables to the data which could be relevant like the number of ads already sold and the creation date of the ad.
=======[1] "The accuracy is : 60.78 %"
The accuracy is 60.78 % which is not very good comparing to decision tree. However one way to improve the prognosis is to add new variables to the data which could be relevant like the number of ads already sold and the creation date of the ad.
>>>>>>> fbecac5e50de9ac80711ed0110f6122d224b69da#----------------------------------------------------------------------------------
# Bayesian Classification - Naive
# Prediction of the Seller knowing the origin / price / category / products_sold / date creation
#-----------------------------------------------------------------------------------
BayesSellers <- function(){
#-----------------
# New Data
#-----------------
# Select all "Drugs & Chemicals" ads
matching_vector <- c( str_detect(data$category, "Drugs & Chemicals"))
bayesian.data <- data[matching_vector,]
# Select the column of the data that are interesting
# ie removing colunm like "id" or "url" that don't give any informations
bayesian.data <- subset(bayesian.data, select=c(origin,category,seller,priceUnitDose, sold_since, products_sold))
# Subset : choose the colunm that you want
# Handling : column category
# Regular expression for spliting the categories
regex <- "/(.*)/(.*)/(.*)"
cat <- str_match(bayesian.data$category, regex)
bayesian.data$category <- cat[,3] # keep only the second part
sellers <- names(sort(table(bayesian.data$seller), decreasing = TRUE))[1:10]
bayesian.data <-subset(bayesian.data, seller %in% sellers)
bayesian.data$seller <- factor(bayesian.data$seller, labels = sellers)
#---------------------
# Bayesian stat
#---------------------
# Random rows :
bayesian.data <- bayesian.data[sample(nrow(bayesian.data),nrow(bayesian.data),replace=FALSE), ]
train.data <- bayesian.data[1:floor(nrow(bayesian.data)/2),]
pred.data <- bayesian.data[(floor(nrow(bayesian.data)/2)+1):nrow(bayesian.data),]
model <- naiveBayes(seller ~ ., data = train.data)
preds <- predict(model, newdata = pred.data)
conf_matrix <- table(preds, pred.data$seller)
acc <- round(sum(diag(conf_matrix)) / sum(conf_matrix)*100, 2)
# Display
conf_matrix <- data.frame(conf_matrix)
names(conf_matrix) <- c("Sellers","Prediction","Freq")
print(conf_matrix)
return(acc)
}
accBayesV2 <- BayesSellers()Table 6 - Sellers Prediction / Bayesian Classification
<<<<<<< HEAD
[1] "The accuracy is : 84.63 %"
Results show that the algorithm succeeds in predicting most of the sellers. Therefore, the accuracy is 84.63 %. Which is still a little bit less than with decision tree. However with more sellers (i.e more than 40 sellers for instance), this algorithm tends to be more accurate than the one based on decision tree method.
=======[1] "The accuracy is : 79.97 %"
Results show that the algorithm succeeds in predicting most of the sellers. Therefore, the accuracy is 79.97 %. Which is still a little bit less than with decision tree. However with more sellers (i.e more than 40 sellers for instance), this algorithm tends to be more accurate than the one based on decision tree method.
>>>>>>> fbecac5e50de9ac80711ed0110f6122d224b69daLater in the section Prediction of profitability , it will be discussed how to exploit at best this two values : the creation date of ads and the number of sold products.
Secondly, one can wonder if there were links between some drugs. That is to say, if this is possible to cluster some drugs.
To do so, firstly, a new data frame has been created with by rows sellers and by columns different sub-categories of “Drugs & Chemicals”. In each cell, value is True or False if the dealer has already sold something in this sub-category or not. Then, Apriori algorithm of Association Rules has been used. One drug has been selected that must be in the itemset. Here it is Ecstasy and results can be seen below.
#--------------------------------------------------------
# Association Rules - Apriori algorithm
# Guess if this dealer is selling this drugs
#--------------------------------------------------------
AssRSellersCat <- function(){
#------------------------------
# New Data frame for analysis
#------------------------------
# Select all ads of "Drugs & Chemicals"
matching_vector <- c( str_detect(data$category, "Drugs & Chemicals"))
data_drugs <- data[matching_vector, ]
# Handling of this categories
# Regular expression for spliting the categories
regex <- "/(.*)/(.*)/(.*)"
cat_exp <- str_match(data_drugs$category, regex)
data_drugs$category <- cat_exp[,3]
# Get rid of category "Other"
matching_vector <- !c( str_detect(data_drugs$category, "Other"))
data_drugs <- data_drugs[matching_vector, ]
# List all the sellers
sellers <-sort(table(data_drugs$seller), decreasing = TRUE)
sellers <- sellers[ sellers != "Null"]
sellers <- sellers [1:100]
#List all categories concerning drugs
list_category <- table(data_drugs[,"category"])
list_cat_drugs <- list_category [ list_category != 0]
# Step 1 : initialise a data.frame with the information of the first seller
# Select all categories of the seller
matching_vector <- c( str_detect(data$seller, names(sellers)[1]))
cat_seller <-summary(data.frame(data[matching_vector, "category"]))
# Loop which creates a boolean vector which tells if the seller sells stuffs in each category
bool_cat <-c()
bool_vec <-c()
for( i in 1: length(list_cat_drugs)){
bool_vec <- str_detect(cat_seller, names(list_cat_drugs)[i])
bool <- FALSE
for(j in 1:length(bool_vec)){
bool <- bool || bool_vec[j]
}
bool_cat[i] <- bool
}
cat_seller.data <- t(data.frame(bool_cat))
colnames(cat_seller.data) <- names(list_cat_drugs)
#Step 2 : Do the same for the other sellers
for(k in 2 : length(sellers)){
# Select all categories of the seller
matching_vector <- c( str_detect(data$seller, names(sellers)[k]))
cat_seller <-summary(data.frame(data[matching_vector, "category"]))
# Loop which creates a boolean vector which tells if the seller sells stuffs in each category
bool_cat <-c()
bool_vec <-c()
for( i in 1: length(list_cat_drugs)){
bool_vec <- str_detect(cat_seller, names(list_cat_drugs)[i])
bool <- FALSE
for(j in 1:length(bool_vec)){
bool <- bool || bool_vec[j]
}
bool_cat[i] <- bool
}
cat_seller.data <- rbind(cat_seller.data,bool_cat)
}
rownames(cat_seller.data)<- names(sellers)
#-------------------------
# Ass Rules
#-------------------------
# Association Rules with rhs containing "Ecstasy" only
rules <- apriori(cat_seller.data,
parameter = list(minlen=2, supp=0.05, conf=0.8),
appearance = list(rhs=c("Ecstasy"),default="lhs"),
control = list(verbose=F))
rules.sorted <- sort(rules, by="lift")
rules.sorted@quality$support <- round(rules.sorted@quality$support, 3)
rules.sorted@quality$confidence <- round(rules.sorted@quality$confidence, 2)
rules.sorted@quality$lift <- round(rules.sorted@quality$lift, 2)
inspect(rules.sorted, linebreak = TRUE)
cat("\n")
cat(" ")
compTab <- TableCaption(compTab, "Drugs links")
# Plot graph of rules
plot(rules.sorted[1:5], method="graph", control=list(type="items"),main ="")
mtext("Association Rules on the product range of sellers" , cex = 1.2)
# Frame
box(which = "outer", lty = "solid")
return(compTab)
}
compTab <- AssRSellersCat() lhs rhs support confidence lift
[1] {Cannabis & Hashish,
DMA } => {Ecstasy} 0.09 1.00 2.78
[2] {Dissociatives,
DMA } => {Ecstasy} 0.10 1.00 2.78
[3] {DMA ,
Opioids} => {Ecstasy} 0.11 1.00 2.78
[4] {DMA ,
Stimulants} => {Ecstasy} 0.13 1.00 2.78
[5] {Dissociatives,
DMA ,
Psychedelics} => {Ecstasy} 0.05 1.00 2.78
[6] {DMA ,
Psychedelics,
Stimulants} => {Ecstasy} 0.06 1.00 2.78
[7] {Cannabis & Hashish,
Dissociatives,
DMA } => {Ecstasy} 0.05 1.00 2.78
[8] {Cannabis & Hashish,
DMA ,
Opioids} => {Ecstasy} 0.06 1.00 2.78
[9] {Cannabis & Hashish,
DMA ,
Stimulants} => {Ecstasy} 0.07 1.00 2.78
[10] {Dissociatives,
DMA ,
Opioids} => {Ecstasy} 0.05 1.00 2.78
[11] {Dissociatives,
DMA ,
Stimulants} => {Ecstasy} 0.09 1.00 2.78
[12] {DMA ,
Opioids,
Stimulants} => {Ecstasy} 0.08 1.00 2.78
[13] {Cannabis & Hashish,
Dissociatives,
DMA ,
Stimulants} => {Ecstasy} 0.05 1.00 2.78
[14] {Cannabis & Hashish,
DMA ,
Opioids,
Stimulants} => {Ecstasy} 0.05 1.00 2.78
[15] {Dissociatives,
DMA ,
Opioids,
Stimulants} => {Ecstasy} 0.05 1.00 2.78
[16] {DMA } => {Ecstasy} 0.19 0.95 2.64
[17] {DMA ,
Psychedelics} => {Ecstasy} 0.11 0.92 2.55
[18] {Dissociatives,
Psychedelics} => {Ecstasy} 0.09 0.82 2.27
[19] {Cannabis & Hashish,
Opioids} => {Ecstasy} 0.09 0.82 2.27
[20] {Psychedelics,
Stimulants} => {Ecstasy} 0.12 0.80 2.22
[21] {Cannabis & Hashish,
Dissociatives} => {Ecstasy} 0.08 0.80 2.22
[22] {Cannabis & Hashish,
Opioids,
Stimulants} => {Ecstasy} 0.08 0.80 2.22
Table 7 - Drugs links
<<<<<<< HEAD
Figure 14 - Drugs links
The algorithm succeeds in finding some rules in the data frame. That means that some drugs can effectively be clustered. The support is between 5% and 15% so it is frequent to have these itemsets. Moreover the confidence is more than 80%. In other words, if there is the itemset on the left we are most likely to have the drug on the right.
These rules can be interpreted as follows : sellers often deal more than one product. And these products can be clustered by type.
After trying to make predictions on the seller (cf Section 4.1), Prognoses on the origin of the ads have been made using Decision Tree method and Association Rules.
1. Decision tree
Looking at prices and categories, a Decision Tree have been created in order to predict the origin. In the same way the algorithm has been trained on one half of the data, and predictions made on the other half.
#----------------------------------------------------------------------
# Decision tree - CART algorithm
# Prediction of the country knowing the price / category
#-----------------------------------------------------------------------
DTorigin <- function(){
#-----------------
# New Data
#-----------------
# Select all "Drugs & Chemicals" ads
matching_vector <- c( str_detect(data$category, "Drugs & Chemicals"))
dectree.data <- data[matching_vector,]
# Select the column of the data that are interesting for the tree
# ie removing colunm like "id" or "url" that don't give any informations
dectree.data <- subset(dectree.data, select=c(origin,category,priceUnitDose))
# Subset : choose the colunm that you want
# Handling : column categorie
# Regular expression for spliting the categories
regex <- "/(.*)/(.*)/(.*)"
cat <- str_match(dectree.data$category, regex)
dectree.data$category <- cat[,3] # keep only the second part
# Handling : country
dectree.data <- dectree.data[which(dectree.data$origin != "Worldwide"),]
tab_coun <- table(dectree.data$origin)
tab_coun <- sort(tab_coun, decreasing=TRUE) # Sorting (biggest in first)
tab_coun <- tab_coun[1:5] # Taking only the most important : main sellers
name_coun <- names(tab_coun)
# New data keeping only the main dealers
dectree.data <-subset(dectree.data, origin %in% name_coun)
# Random rows :
dectree.data <- dectree.data[sample(nrow(dectree.data),nrow(dectree.data),replace=FALSE), ]
#---------------------
# Decision tree
#---------------------
# Factor
dectree.data$origin <- factor(dectree.data$origin)
# Half of the data for making the decision tree
train <- dectree.data[1:(floor(nrow(dectree.data))/2),]
# Creation of the tree
tree <- rpart(origin ~.,data=train, method="class")
# Plot
fancyRpartPlot(tree, sub="")
# Frame
box(which = "outer", lty = "solid")
#--------------------
# Prediction
#--------------------
# The other half for the prediction
test <- dectree.data[(floor(nrow(dectree.data)/2)+1):nrow(dectree.data),]
# Making prediction
pred <- predict(tree,test,type="class")
# Analysis:
# Comparison between the result and the prediction (prediction in colunm)
conf <- table(test[,match("origin",names(test))],pred)
# Accuracy :
acc <- round((sum(diag(conf)) / sum(conf)*100),2)
print(conf)
cat("\n")
cat(" ")
compTab <- TableCaption(compTab, "Origins Prediction / Decision Tree method")
Result <- c(acc,compTab)
return(Result)
}
ResultDT2 <- DTorigin() pred
Australia Germany Netherlands United Kingdom
Australia 97 0 15 41
Germany 20 0 97 67
Netherlands 44 0 186 19
United Kingdom 23 0 123 221
United States 72 0 48 152
pred
United States
Australia 493
Germany 469
Netherlands 381
United Kingdom 793
United States 1691
Table 8 - Origins Prediction / Decision Tree method
pred
Australia Germany Netherlands United Kingdom United States
Australia 92 0 15 49 489
Germany 21 0 95 103 499
Netherlands 46 0 178 22 388
United Kingdom 26 0 119 269 801
United States 75 0 40 209 1869
Table 8 - Origins Prediction / Decision Tree method
Figure 15 - Origin Decision Tree
<<<<<<< HEAD
// Results don’t seem to be very good, the accuracy is 43.45 % which is lower than previously. It turns out that without the Seller, which give a lot information on the origin, prognoses are not very reliable.
=======// Results don’t seem to be very good, the accuracy is 44.55 % which is lower than previously. It turns out that without the Seller, which give a lot information on the origin, prognoses are not very reliable.
>>>>>>> fbecac5e50de9ac80711ed0110f6122d224b69daLet’s see if correlations can be found between categories and origins. //
2. Rules to determine cateogries of ads coming from United States
Secondly, the algorithm has been run with 2 variables : category and origin. Thus, one may be able to make a link with predictions of above decision tree. One country has been fixed, here United States.
#--------------------------------------------------------
# Association Rules - Apriori algorithm
# Guess if United States is the origin of the ad
#--------------------------------------------------------
AssROriginSellerCat <- function(){
#------------------------------
# New Data frame for analysis
#------------------------------
# Select all ads of "Drugs & Chemicals"
matching_vector <- c( str_detect(data$category, "Drugs & Chemicals"))
data_drugs <- data[matching_vector, ]
# Select some columns
asso.data <- subset(data_drugs, select = c(origin,category))
# Get rid of the first part of the category name "/Drugs & Chemicals/"
asso.data$category <- gsub(pattern = "/Drugs & Chemicals/", replacement = "", asso.data$category)
asso.data$origin <- factor(asso.data$origin)
asso.data$category <-factor(asso.data$category)
# asso.data$seller <- factor(asso.data$seller)
# Association Rules with rhs containing one given country only
rules <- apriori(asso.data,
parameter = list(minlen=2, supp=0.0005, conf=0.5),
appearance = list(rhs=c("origin=United States"),default="lhs"),
control = list(verbose=F))
rules.sorted <- sort(rules, by="lift")
rules.sorted@quality$support <- round(rules.sorted@quality$support, 3)
rules.sorted@quality$confidence <- round(rules.sorted@quality$confidence, 2)
rules.sorted@quality$lift <- round(rules.sorted@quality$lift, 2)
inspect(rules.sorted)
cat("\n")
cat(" ")
compTab <- TableCaption(compTab, "Origin Prediction / Association Rules")
# Plot graph of rules
plot(rules.sorted, method="graph", control=list(type="items"),main ="")
mtext("Association Rules on the category and seller to deduce the country" , cex = 1.2)
# Frame
box(which = "outer", lty = "solid")
return(compTab)
}
compTab <- AssROriginSellerCat() lhs rhs support confidence lift
[1] {category=Cannabis & Hashish/Concentrates} => {origin=United States} 0.028 0.76 2.95
[2] {category=Cannabis & Hashish/Topicals & Others} => {origin=United States} 0.002 0.68 2.63
[3] {category=Stimulants/Adderal & Vyvanse} => {origin=United States} 0.001 0.67 2.58
[4] {category=Cannabis & Hashish/Edibles} => {origin=United States} 0.023 0.65 2.51
[5] {category=Opioids/Pills} => {origin=United States} 0.020 0.53 2.07
[6] {category=Paraphernalia/Paraphernalia} => {origin=United States} 0.006 0.52 2.01
Table 9 - Origin Prediction / Association Rules
<<<<<<< HEAD
Figure 16 - Rules origin
The results show that when there is an ad of the category on the left, it is likely to come from United States with a confidence higher than 50%. Thus, ads from United States are often on Cannabis & Hashish, this can be easily confirmed by plotting a pie chart of United States exportations, as in the section before.
It is striking to see that Cannabis & Hashish seems to be the main rule. This can be explained by the legalisation of Cannibis in some states. Thus, the sales of these products is easy in United States and may interest people from other countries where they are not legalized.
Secondly, it worths to predict the profitability of an ad. That is to say, given an ad to predict if it will be sold a lot or not. Each ad have information on the category, origin, seller, price and a rate of profitability. This rate is caculated by dividing the number of product sold by the current lifetime of the ad and by times 30 to have a number of ads sold monthly. Bayesian Neural Network algorithm has beed accomplished on this new data and the results obtained are below.
#----------------------------------------------------------------------
# Bayesian Network
# with seller / origin / price / category / timestamp / sold_since / product_sold
#-----------------------------------------------------------------------
BayesNet <- function(){
#-----------------
# New Data
#-----------------
# Select all "Drugs & Chemicals" ads
matching_vector <- c( str_detect(data$category, "Drugs & Chemicals"))
bayesian.data <- data[matching_vector,]
# Select the column of the data that are interesting
# ie removing colunm like "id" or "url" that don't give any information
bayesian.data <- subset(bayesian.data, select=c(origin,category,seller,priceUnitDose, products_sold, sold_since, timestamp ))
# Subset : choose the colunm that you want
# Handling : column category
# Regular expression for spliting the categories
regex <- "/(.*)/(.*)/(.*)"
cat <- str_match(bayesian.data$category, regex)
bayesian.data$category <- cat[,3] # keep only the second part
#Get rid of lines with Null as products_sold value
bayesian.data <- bayesian.data[!is.element(bayesian.data$products_sold, "NULL"),]
#Convert products_sold to numeric and discretize it
bayesian.data$products_sold <- as.numeric(as.character(bayesian.data$products_sold))
#Given timestamp and sold_since calculate the lifetime of the ad
bayesian.data$sold_since <- as.Date(bayesian.data$sold_since)
bayesian.data$timestamp <- as.Date(bayesian.data$timestamp)
bayesian.data$timestamp <- bayesian.data$timestamp - bayesian.data$sold_since
bayesian.data$timestamp <- as.numeric(bayesian.data$timestamp)
# 1 day on the market at least
bayesian.data <- bayesian.data[which(bayesian.data$timestamp > 0),]
#Calculate profitability
bayesian.data$products_sold <- bayesian.data$products_sold / bayesian.data$timestamp * 30
names(bayesian.data)[match("products_sold",names(bayesian.data))] <- "profitability"
#Discretize profitability
nbCategory <- 5
bayesian.data$profitability <- arules::discretize(bayesian.data$profitability, method="frequency", categories = nbCategory)
bayesian.data <- subset(bayesian.data, select= -c(sold_since, timestamp))
#Convert variables to factor
bayesian.data$category <- as.factor(bayesian.data$category)
bayesian.data$seller <- as.factor(bayesian.data$seller)
bayesian.data$origin <- as.factor(bayesian.data$origin)
#Get rid of lines with NA as products_sold value
bayesian.data <- bayesian.data[!is.element(bayesian.data$profitability, NA),]
#---------------------
# Bayesian Network
#---------------------
res <- hc(bayesian.data)
plot(res)
fittedbn <- bn.fit(res, data = bayesian.data)
prob.data <- data.frame(fittedbn$profitability$prob)
colnames(prob.data) <- c("Profitability", "Category", "Probability")
print(prob.data)
compTab <- TableCaption(compTab, "Categories Profitability")
#Handling interval
interv <- levels(bayesian.data$profitability)
interv <- unlist(strsplit(interv, ","))
interv <- gsub(pattern = "[^0-9.]*", replacement = "", interv)
interval <- data.frame(interv[seq(1, length(interv), 2)],interv[seq(2, length(interv), 2)])
colnames(interval) <- c("left", "right")
interval$left <- as.numeric(levels(interval$left))
interval$right <- as.numeric(levels(interval$right))
expectancy <- c()
#Calculate expectancy for each category
for(i in 0:(length(table(bayesian.data$category))-1)){
left <- 0
right <- 0
for(j in 1:nbCategory){
left<-left + interval$left[j] * fittedbn$profitability$prob[i*nbCategory + j]
right<-right + interval$right[j] * fittedbn$profitability$prob[i*nbCategory + j]
}
expectancy[i+1] <- paste("[", round(left,2) , "," , round(right,2) , "]")
}
affichage <-data.frame(names(table(bayesian.data$category)),expectancy)
colnames(affichage) <- c("Category", "Expectancy")
print(affichage)
compTab <- TableCaption(compTab, "Categories Expectancy")
# Frame
box(which = "outer", lty = "solid")
mtext("Conditional dependency between variables" , cex = 1.2,side = 1)
return(compTab)
}
compTab <- BayesNet()Table 10 - Categories Profitability
<<<<<<< HEAD
=======
>>>>>>> fbecac5e50de9ac80711ed0110f6122d224b69da
Table 11 - Categories Expectancy
Figure 17 - Variable Dependencies
Neural Network shows that the profitability is conditionnaly dependant to category. That is to say category has a significant impact on profitability. Conditional probabilities are shown in the array. It is surprising that price and seller have no impacts on profitability.
Furthermore, expectancy of each event profitability X given category Y has been calculated. Thus, we can find the most profitable products to sold. Apparently it seems to be presciption, steroid and opioids. May be they are more profitable because they are not “common products” (and still wanted) contrary to Cannabis or Cocaine which can be found more easily on the street.
[1] Baravalle A, Lopez MS, Lee SW. " Mining the dark web: Drugs and fake ids ".
[2] Baraniuk C. " BBC - alphabay and hansa dark web markets shut down " 20/07/2017. [Online]. [Accessed : 24/07/2017]. Available: http://www.bbc.co.uk/news/technology-40670010.
[3] Kopan T. " CNN - doj announces takedown of dark web market alphabay " 20/07/2017. [Online]. [Accessed : 24/07/2017]. Available: http://edition.cnn.com/2017/07/20/politics/doj-takes-down-dark-web-marketplace-alphabay/index.html.
[4] " Google trends - explore: AlphaBay ". [Online] [Accessed : 22/07/2017]. Available: https://trends.google.co.uk/trends/explore?date=2014-12-01%202017-07-01&q=alphabay.
[5] " Google trends - explore: AlphaBay, dream market ". [Online] [Accessed : 22/07/2017]. Available: https://trends.google.co.uk/trends/explore?date=2015-01-01%202017-07-01&q=alphabay,Dream%20Market.
[6] " Cocaine - trafficking and supply (eu drug markets report) ". Paragraph: Modes of Transport 25/07/2017 [Online]. [Accessed : 24/07/2017]. Available: http://www.emcdda.europa.eu/publications/eu-drug-markets/2016/online/cocaine/trafficking-and-supply.
[7] " DrugWise - how much do drugs cost? ". [Online] [Accessed : 12/07/2017]. Available: http://www.drugwise.org.uk/how-much-do-drugs-cost/.
[8] " The telegraph - the cost of street drugs in britain. ". [Online] [Accessed : 12/07/2017]. Available: http://www.telegraph.co.uk/news/uknews/crime/11346133/The-cost-of-street-drugs-in-Britain.html.
[9] " RehabCenters.net - the average cost of illegal street drugs. ". [Online] [Accessed : 12/07/2017]. Available: http://www.rehabcenter.net/the-average-cost-of-illegal-drugs-on-the-street/.
[10] " Canada.com - what illegal drugs cost on the street around the world. ". [Online] [Accessed : 12/07/2017]. Available: http://o.canada.com/business/interactive-what-illegal-drugs-cost-on-the-street-around-the-world.
[11] " The student pocket guide - facts about drugs. ". [Online] [Accessed : 12/07/2017]. Available: http://www.thestudentpocketguide.com/2012/01/student-life/health-and-relationships/facts-about-drugs/.
[12] " R markdown from r studio - r markdown tutorial " . [Online] [Accessed : 29/06/2017]. Available: http://rmarkdown.rstudio.com/lesson-1.html.
[13] Tan P-N, Steinback M, Kumar V. " Introduction to data mining ". Addison Wesley; 2006.